Benchmarking Policies

In this tutorial, we will showcase the process of testing the accuracy of classifications made by the DynamoGuard using your dataset in CSV format. This will help you understand how your data interacts with our API and provide insights into improving your model's accuracy.

Prerequisites

Before you start, ensure you have the following:

DynamoAI Access Token
DynamoAI API URL
Dataset follows required format (See section Dataset CSV Format Requirement)

The API URL can be found in the section sending POST requests in the Code Export of policies.

Environment Setup

Ensure you have Python installed on your system and install the necessary libraries, including pandas, numpy, aiohttp, sklearn, and others as required by the code.

Asynchronous Structure

The benchmark script is designed with an asynchronous structure to enhance efficiency and scalability when sending multiple requests to the Guardrail LLM API. Asynchronous programming allows the script to initiate and manage numerous API calls concurrently, significantly reducing the overall execution time compared to a synchronous approach where each request must be processed sequentially.

This structure is particularly beneficial when dealing with large datasets, as it ensures that the script can handle a high volume of prompts without being bottlenecked by network latency or API response times. By leveraging Python's asyncio library and the aiohttp client for asynchronous HTTP requests, the script can send out requests, wait for responses, and process the results as they arrive, all in a non-blocking manner.

Utilizing asynchronous programming principles, the main function orchestrates the entire process, from reading the dataset and initializing the benchmark instance to executing the accuracy test and saving the results. This approach ensures that the script remains responsive and efficient, even when scaling up to handle extensive datasets or when integrated into larger, more complex systems for automated LLM benchmarking.

Package Import

To begin, let's import the necessary libraries needed for the benchmarking process

import os
import time
import numpy as np
import pandas as pd
import asyncio, aiohttp

from datasets import load_dataset
from pprint import pprint
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score, confusion_matrix
import numpy as np

# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

Dataset CSV Format Requirement

Your dataset should be in a CSV format with at least two columns: Prompt and Label. The Prompt column contains the text to classify, and the Label column contains the ground truth classification: safe and unsafe.

Example CSV format:

Prompt	Label
"Example prompt"	safe
"Another example"	unsafe

Prompt: A string containing the text (prompt) to classify. Label: A string (safe or unsafe) as ground truth classification of the prompt.

Code Summary

In the initialization phase, we set up the necessary parameters and read the dataset from a CSV file. This step involves specifying the API URL, the policy ID for testing, the authentication token for API access, and the filename of the CSV-format dataset as described above. You will be able to find arguments api_url, policy_id, model_id, and auth_token, in below code snippet, on DynamoGuard platform.

First, we define the DGuardBenchmark class.

After loading the dataset, the main function initializes an instance of the DGuardBenchmark class with these parameters. This instance will be used to send asynchronous requests to the Guardrail LLM API for classifying the prompts, testing the accuracy of the classifications against the ground truth labels, generating a report of the results, and finally saving this report to a CSV file for further analysis.
The send_request asynchronous method sends a single classification request to the DynamoAI API and captures the prediction. It ensures that the label is valid and handles any errors during the request.
The get_prediction method orchestrates the sending of asynchronous requests for the entire dataset. It gathers the responses and calculates the accuracy based on the expected labels.
Finally, the report_results and save_csv methods are used to generate a classification report and save the results, including the predictions and their accuracy, to a CSV file.

class DGuardBenchmark:
    def __init__(self, api_url, policy_id, auth_token, model_id=None, saving_title="results"):
        self.api_url = api_url
        if isinstance(policy_id, str):
            self.policy_id = [policy_id]
        elif isinstance(policy_id, list):
            self.policy_id = policy_id
        self.model_id = model_id
        if 'chat/session_id' in self.api_url:
            assert self.model_id is not None, "Model ID is required for chat/session_id API endpoint"
        self.headers = {
            'Content-Type': 'application/json',
            'Authorization': f'Bearer {auth_token}',
        }
        self.results = []
        self.stats = []
        self.preds = []
        self.labels = []
        self.saving_title = saving_title
        
    async def get_predictions(self, prompts, labels, show_report=False):
        """
        Test the consistency of classification and compute various metrics.
        Returns a dictionary containing f1, accuracy, fnr, fpr, precision, and recall.
        """       
        tasks = [] 
        preds = []

        # Collect predictions
        for prompt, label in zip(prompts, labels):
            tasks.append(self.send_request(prompt, str(label)))
        responses = await asyncio.gather(*tasks)

        # Process responses
        valid_preds = []
        valid_labels = []
        total_cnter = 0

        for i, response in enumerate(responses):
            input_prompt, ground_truth, pred = response
            self.preds.append(pred)
            self.labels.append(ground_truth)
            self.stats.append((input_prompt, ground_truth, pred))

            # Only include non-error predictions in metrics
            if pred != 'error':
                valid_preds.append(pred)
                valid_labels.append(ground_truth)
                total_cnter += 1
                
    async def send_request(self, prompt, label):
        """Send a single asynchronous request to the DynamoGuard API."""
        # Sanity check
        label = label.lower() # Allow case-insensitive
        if label not in ['safe', 'unsafe']: raise ValueError(f"Invalid label: {label}")

        async with aiohttp.ClientSession() as session:
            if 'chat/session_id' in self.api_url:
                json_data = {
                    'messages': [{'role': 'user', 'content': prompt}],
                    "modelId": self.model_id
                }
            else:
                json_data = {
                    'messages': [{'role': 'user', 'content': prompt}],
                    "textType": "MODEL_INPUT",
                    "policyIds": self.policy_id
                }
            try:
                await asyncio.sleep(0.1)

                async with session.post(self.api_url, headers=self.headers, json=json_data, ssl=False) as response:
                    response_json = await response.json()
                    if 'chat/session_id' in self.api_url:
                        final_action = response_json['analyses'][0]['finalAction']
                    else:
                        final_action = response_json['finalAction']
                    if final_action == 'NONE':
                        pred = 'safe'
                    else:
                        pred = 'unsafe'
            except Exception as e:
                try:
                    pprint(f"Request failed: {e}. Response received: ")
                    pprint(response_json)
                except:
                    pprint(f"Request failed: {e}. No response received.")
                pred = 'error'

            return prompt, label, pred

                
    def report_results(self):
        """Generate the classification report and prepare it for CSV export."""
        # Generate classification report as a dictionary
        report = classification_report(self.labels, self.preds, target_names=['safe', 'unsafe'], output_dict=True)

        #drop the average row
        accuracy = report['accuracy']
        report.pop('accuracy')

        # Convert to DataFrame, transpose for easier manipulation, and reset index to make the index a column
        self.report_df = pd.DataFrame(report).transpose().reset_index().round(2)

        # Rename columns to include the title as the first column
        self.report_df.columns = ['Title'] + self.report_df.columns[1:].tolist()

        # Add an empty column to the report DataFrame for alignment
        self.report_df[' '] = ''

        # Insert a divider row
        divider_index = self.report_df[self.report_df['Title'] == 'weighted avg'].index[0] + 1
        divider_row = pd.DataFrame([['---', np.nan, np.nan, np.nan, np.nan, '']], columns=self.report_df.columns, index=[divider_index])
        self.report_df = pd.concat([self.report_df.iloc[:divider_index], divider_row, self.report_df.iloc[divider_index:]]).reset_index(drop=True)

        # Add one more row for the accuracy
        accuracy_row = pd.DataFrame([['accuracy', np.nan, np.nan, np.nan, f"{accuracy:.2f}", np.nan]], columns=self.report_df.columns)
        self.report_df = pd.concat([self.report_df, accuracy_row]).reset_index(drop=True)

    def save_csv(self):
        """Save the report and results to CSV, with the report at the top-left."""
        # Convert stats to DataFrame
        stats_df = pd.DataFrame(self.stats, columns=["Prompt", "Ground Truth", "Prediction"])

        # Prepare an empty DataFrame to align the report with stats
        empty_rows_needed = max(0, len(stats_df) - len(self.report_df))
        empty_df_for_alignment = pd.DataFrame(np.nan, index=range(empty_rows_needed), columns=self.report_df.columns)

        # Concatenate the empty DataFrame with the report for vertical alignment
        aligned_report_df = pd.concat([self.report_df, empty_df_for_alignment], ignore_index=True)

        # Now concatenate the aligned report with the stats horizontally
        final_df = pd.concat([aligned_report_df, stats_df], axis=1)

        # Save to CSV
        filename = f"{self.saving_title}_{time.strftime('%Y%m%d-%H%M%S')}.csv"
        print(f'*** Saving results to CSV with filename: {filename} ***')

        final_df.to_csv(filename, index=False)

The main function demonstrates this process step by step, ensuring a streamlined workflow from dataset preparation to result analysis.

async def main():
    '''
    api_url: str, the endpoint of the Guardrail LLM API
            - At this moment (Apr. 17, 2024), the endpoint should be either 
                `https://api.dynamo.ai/v1/moderation/analyze/` (to specify policy/ policies to test on) or
                `https://api.dynamo.ai/v1/moderation/chat/session_id` (to test all deployed policy)
    policy_id: str or list, specify policy ID to test against. 
            - When specifying policy/policies to benchmark, please use `/analyze/` endpoint.
            - If policy_id is a list, the benchmark will be conducted for all policies in the list.
    model_id: str, specify model ID to test against.
    auth_token: str, authentication token for API access
    filename: str, path to the CSV file containing the dataset
    '''
    api_url = "https://api.uat.dynamo.ai/v1/moderation/analyze/"
    policy_id = <policy-id>  # The specific policy ID to test against
    model_id = None
    auth_token = <auth-token>
    filename = "dynamoai-ml/v2408_FinancialAdvice-benchmark-outguard-data_Meta-Llama-3.1-70B-Instruct-Turbo"
    input_col = 'prompt'
    label_col = 'label'
    
    # Read prompt from CSV
    try:
        data = load_dataset(filename)['train']
    except:
        data = pd.read_csv(filename)
    
    try:
        prompts = data[input_col]
        labels = data[label_col]
    except:
        warnings.warn("'prompt' and 'label' extraction led to a key error. Assuming this is not an output guardrail dataset with 'response' and 'response_label'")
        input_col = "response"
        label_col = "response_label"
        prompts = data[input_col]
        labels = data[label_col]
    
    if type(prompts) != list:
        prompts = data[input_col].tolist()
    if type(labels) != list:
        labels = data[label_col].tolist()

    benchmark = DGuardBenchmark(
        api_url=api_url,
        policy_id=policy_id,
        auth_token=auth_token,
        model_id=model_id,
        saving_title= filename.split('/')[-1].split('.')[0]+f'_results'
    )

    await benchmark.get_predictions(prompts, labels)
    benchmark.report_results()
    benchmark.save_csv()

await main()

Metrics

In the result, CSV file generated by the Guardrail LLM Benchmark tool, several key metrics are included to evaluate the performance of the classifications. These metrics are essential for understanding how well the model performs across different aspects:

Precision: The ratio of correctly predicted positive observations to the total predicted positives. It shows how many of the positively predicted cases were actually positive.
Recall (Sensitivity): The ratio of correctly predicted positive observations to all observations in the actual class. It measures the model's ability to capture actual positives.
F1-score: The harmonic mean of Precision and Recall. F1-score reflect comprehensive performance especically the dataset is imbalanced. For instance, safe/unsafe prompts are much more than another category.
Support: The number of actual occurrences of the class in the specified dataset. It provides insight into the imbalance of the dataset with respect to the classes.
Macro Average: The average of the Precision, Recall, and F1-score for each class, without taking the imbalance of the classes into account. It treats all classes equally, calculating metrics for each class independently and then taking the average.
Weighted Average: Similar to the Macro Average, but in this case, each class's metric is weighted by its support. This accounts for class imbalance by giving more weight to metrics of classes with more instances.
Accuracy: The accuracy of the DynamoGuard's prediction of if the input prompt is safe or unsafe.

These metrics together provide a comprehensive overview of the model's performance, highlighting areas of strength and potential improvement. Precision, Recall, and F1-score offer a balanced view of the model's accuracy, while Support indicates the dataset's class distribution, and the averages (Macro and Weighted) give insights into the model's overall effectiveness across different classes.

Conclusion

Following this guide will allow you to test the classifications made by the DynamoAI API against your dataset accurately. This process is crucial for understanding the performance of your models and making necessary adjustments.

This software is provided "as is", without warranty of any kind, express or implied. Use of this software is governed by the Terms of Use between you or your employer and DynamoAI. The above restrictions highlight the prohibition of unauthorized use, distribution, and modification.

Prerequisites​

Environment Setup​

Asynchronous Structure​

Package Import​

Dataset CSV Format Requirement​

Code Summary​

Metrics​

Conclusion​